class: inverse, center, middle background-image: url("data:image/png;base64,#img/tidyverse.png") background-position: 95% 95% background-size: 25% # Using the _tidyverse_ ### Maximilian H.K. Hesselbarth #### University of Michigan (EEB) 2022-03-25 --- # The _tidyverse_ .pull-left[ <img src="data:image/png;base64,#img/hexlogos.png" width="75%" style="display: block; margin: auto;" /> .ref[www.tidyverse.org] ] .pull-right[ <img src="data:image/png;base64,#img/tasks.png" width="100%" style="display: block; margin: auto;" /> .ref[Wickham, H., Grolemund, G., 2016. R for Data Science, 1st ed. O’Reilly, Newton (USA).] ] --- # Tidy data -- 1. Each **variable** forms a **column**. -- 2. Each **observation** forms a **row**. -- 3. Each **value** must have its own **cell**. .ref[Wickham, H., 2014. Tidy Data. Journal of Statistical Software 59, 1–23. https://doi.org/10.18637/jss.v059.i10] -- <img src="data:image/png;base64,#img/tidydata.png" width="75%" style="display: block; margin: auto;" /> .ref[Illustration from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst] --- # Tidy data <img src="data:image/png;base64,#img/table.png" width="100%" style="display: block; margin: auto;" /> .ref[Wickham, H., Grolemund, G., 2016. R for Data Science, 1st ed. O’Reilly, Newton (USA).] --- # Install and load ```r install.packages("tidyverse") ``` ```r library(tidyverse) ``` ``` ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ── ``` ``` ## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4 ## ✓ tibble 3.1.6 ✓ dplyr 1.0.8 ## ✓ tidyr 1.2.0 ✓ stringr 1.4.0 ## ✓ readr 2.1.2 ✓ forcats 0.5.1 ``` ``` ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() ``` --- background-image: url("data:image/png;base64,#img/tidyverse.png") background-position: 95% 10% background-size: 25% # Core packages `readr` : Read rectangular data `tibble` : Modern re-imagining of data frames `stringr` : Functions to work with strings (i.e. sequence of characters) `forcats` : Functions to modify factors (i.e. categorical data) `tidyr` : Functions to tidy/reshape data `dplyr` : Functions for data manipulation `purrr` : Functional programming `ggplot2` : Data visualization --- background-image: url("data:image/png;base64,#img/maggritr.png") background-position: 95% 10% background-size: 10% # Pipe operator `%>%` -- - `f(x)` is equivalent to `x %>% f` -- - `f(x, y)` is equivalent to `x %>% f(y)` -- - `f(y, x)` is equivalent to `x %>% f(y, .)` -- ```r set.seed(42) x <- runif(n = 10) *min(log(sort(x))) ## [1] -2.004953 *x %>% sort() %>% log() %>% min() ## [1] -2.004953 set.seed(42) *10 %>% runif() %>% sort() %>% log() %>% min() ## [1] -2.004953 ``` --- class: inverse # Palmer penguins dataset <!-- - `palmerpenguins` dataset --> <!-- - Nesting observations, penguin size data, and isotope measurements from blood samples for adult Adélie, Chinstrap, and Gentoo penguins. --> <img src="data:image/png;base64,#img/penguins.png" width="65%" style="display: block; margin: auto;" /> .ref[Horst A.M., Hill A.P., Gorman K.B., 2020. palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0.] --- background-image: url("data:image/png;base64,#img/readr.png") background-position: 95% 10% background-size: 10% # Read data: _readr_ -- - Reads automatically into `tibble` -- - Different `readr::read_*()` functions for different data types -- - Functions to write data `readr::write_*()` -- - `readxl` as alternative for Excel data -- ```r *df_penguins <- readr::read_csv("data/penguins_raw.csv") %>% magrittr::set_names(names(.) %>% stringr::str_remove_all(pattern = " ") %>% stringr::str_remove_all(pattern = "\\([^()]+\\)")) ## Rows: 344 Columns: 17 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (9): studyName, Species, Region, Island, Stage, Individual ID, Clutch C... ## dbl (7): Sample Number, Culmen Length (mm), Culmen Depth (mm), Flipper Leng... ## date (1): Date Egg ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` --- background-image: url("data:image/png;base64,#img/tidyr.png") background-position: 95% 10% background-size: 10% # Tidy data: _tidyr_ -- - Some functions to deal with `NA` values ```r df_penguins <- tidyr::drop_na(df_penguins, BodyMass) ``` -- - `nest()`/`unnest()` to organize data ```r tidyr::nest(df_penguins, data = -studyName) ``` ``` ## # A tibble: 3 × 2 ## studyName data ## <chr> <list> ## 1 PAL0708 <tibble [109 × 16]> ## 2 PAL0809 <tibble [114 × 16]> ## 3 PAL0910 <tibble [119 × 16]> ``` -- - `pivot_longer()` & `pivot_wider()` most important functions to **reshape** data --- background-image: url("data:image/png;base64,#img/dplyr.png") background-position: 95% 10% background-size: 10% # Wrangle data: _dplyr_ -- - `filter()` to subset **rows**; `select()` to subset **columns** ```r dplyr::filter(df_penguins, BodyMass >= quantile(BodyMass, 0.75), Sex != "FEMALE") %>% dplyr::select_if(is.numeric) %>% head(5) ``` ``` ## # A tibble: 5 × 7 ## SampleNumber CulmenLength CulmenDepth FlipperLength BodyMass Delta15N Delta13C ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 110 43.2 19 197 4775 9.32 -25.5 ## 2 2 50 16.3 230 5700 8.15 -25.4 ## 3 4 50 15.2 218 5700 8.26 -25.4 ## 4 5 47.6 14.5 215 5400 8.23 -25.5 ## 5 8 46.7 15.3 219 5200 8.23 -25.4 ``` -- - `pull()` to convert **one** column as vector ```r dplyr::pull(df_penguins, FlipperLength) %>% head(10) ``` ``` ## [1] 181 186 195 193 190 181 195 193 190 186 ``` --- background-image: url("data:image/png;base64,#img/dplyr.png") background-position: 95% 10% background-size: 10% # Wrangle data: _dplyr_ -- - `mutate()` to **create/modify** columns - `case_when()` as vectorised **ifelse** statements - `slice()` to subset **rows** by position -- ```r dplyr::select(df_penguins, IndividualID, Species, CulmenLength) %>% * dplyr::mutate(CulmenLength_cm = CulmenLength / 10, CulmenLenth_class = dplyr::case_when(CulmenLength < 35 ~ "small", * CulmenLength >= 35 & CulmenLength <= 45 ~ "med", CulmenLength > 45 ~ "large")) %>% dplyr::slice(sample(1:nrow(.), size = 3)) ``` ``` ## # A tibble: 3 × 5 ## IndividualID Species CulmenLength CulmenLength_cm CulmenLenth_cla… ## <chr> <chr> <dbl> <dbl> <chr> ## 1 N62A1 Chinstrap penguin … 46.4 4.64 large ## 2 N13A1 Adelie Penguin (Py… 38.8 3.88 med ## 3 N92A1 Chinstrap penguin … 45.7 4.57 large ``` --- background-image: url("data:image/png;base64,#img/dplyr.png") background-position: 95% 10% background-size: 10% # Wrangle data: _dplyr_ -- - `group_by()` to group by **column** - `summarise()` to **summarize** each group - `n()` to **count** observations within each group (context dependent) -- ```r *df_penguins_sum <- dplyr::group_by(df_penguins, Island, Species) %>% * dplyr::summarise(n = dplyr::n(), FlipperLength_m = mean(FlipperLength), FlipperLength_sd = sd(FlipperLength), * .groups = "drop") ``` --- background-image: url("data:image/png;base64,#img/dplyr.png") background-position: 95% 10% background-size: 10% # Wrangle data: _dplyr_ -- - `*_join()` to combine columns from `x` and `y` using matching **keys** ```r *dplyr::left_join(x = df_penguins, y = df_penguins_sum, by = c("Island", "Species")) %>% dplyr::select(Species, Island, tidyselect::starts_with("Flipper")) %>% dplyr::slice(sample(1:nrow(.), size = 5)) ``` ``` ## # A tibble: 5 × 5 ## Species Island FlipperLength FlipperLength_m FlipperLength_sd ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 Adelie Penguin (Pygosce… Dream 190 190. 6.59 ## 2 Gentoo penguin (Pygosce… Biscoe 213 217. 6.48 ## 3 Adelie Penguin (Pygosce… Biscoe 198 189. 6.73 ## 4 Adelie Penguin (Pygosce… Biscoe 174 189. 6.73 ## 5 Chinstrap penguin (Pygo… Dream 187 196. 7.13 ``` -- <img src="data:image/png;base64,#img/joins.png" width="30%" style="display: block; margin: auto;" /> .ref[https://statisticsglobe.com/r-dplyr-join-inner-left-right-full-semi-anti] --- background-image: url("data:image/png;base64,#img/purrr.png") background-position: 95% 10% background-size: 10% # Functional programming: _purrr_ -- - `map_*()` to apply function to each element (vector/list) ```r species_names <- unique(df_penguins$Species) *purrr::map(species_names, function(i) { dplyr::filter(df_penguins, Species == i) %>% dplyr::pull(Island) %>% unique() %>% stringr::str_sort() %>% paste(collapse = ", ")}) ``` ``` ## [[1]] ## [1] "Biscoe, Dream, Torgersen" ## ## [[2]] ## [1] "Biscoe" ## ## [[3]] ## [1] "Dream" ``` -- ```r *purrr::map_int(species_names, function(i) { dplyr::filter(df_penguins, Species == i) %>% dplyr::pull(Island) %>% unique %>% length}) ``` ``` ## [1] 3 1 1 ``` --- class: inverse ## Thank you for your attention ### Questions? .pull-left[ Further resources: [https://mhesselbarth.github.io/advanced-r-workshop/resources](https://mhesselbarth.github.io/advanced-r-workshop/resources) Exercise: [https://mhesselbarth.github.io/advanced-r-workshop/exercise_tidyverse](https://mhesselbarth.github.io/advanced-r-workshop/exercise_tidyverse) ] .pull-right[ <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M464 64H48C21.49 64 0 85.49 0 112v288c0 26.51 21.49 48 48 48h416c26.51 0 48-21.49 48-48V112c0-26.51-21.49-48-48-48zm0 48v40.805c-22.422 18.259-58.168 46.651-134.587 106.49-16.841 13.247-50.201 45.072-73.413 44.701-23.208.375-56.579-31.459-73.413-44.701C106.18 199.465 70.425 171.067 48 152.805V112h416zM48 400V214.398c22.914 18.251 55.409 43.862 104.938 82.646 21.857 17.205 60.134 55.186 103.062 54.955 42.717.231 80.509-37.199 103.053-54.947 49.528-38.783 82.032-64.401 104.947-82.653V400H48z"></path></svg> [mhessel@umich.edu](mailto:mhessel@umich.edu) <svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M336.5 160C322 70.7 287.8 8 248 8s-74 62.7-88.5 152h177zM152 256c0 22.2 1.2 43.5 3.3 64h185.3c2.1-20.5 3.3-41.8 3.3-64s-1.2-43.5-3.3-64H155.3c-2.1 20.5-3.3 41.8-3.3 64zm324.7-96c-28.6-67.9-86.5-120.4-158-141.6 24.4 33.8 41.2 84.7 50 141.6h108zM177.2 18.4C105.8 39.6 47.8 92.1 19.3 160h108c8.7-56.9 25.5-107.8 49.9-141.6zM487.4 192H372.7c2.1 21 3.3 42.5 3.3 64s-1.2 43-3.3 64h114.6c5.5-20.5 8.6-41.8 8.6-64s-3.1-43.5-8.5-64zM120 256c0-21.5 1.2-43 3.3-64H8.6C3.2 212.5 0 233.8 0 256s3.2 43.5 8.6 64h114.6c-2-21-3.2-42.5-3.2-64zm39.5 96c14.5 89.3 48.7 152 88.5 152s74-62.7 88.5-152h-177zm159.3 141.6c71.4-21.2 129.4-73.7 158-141.6h-108c-8.8 56.9-25.6 107.8-50 141.6zM19.3 352c28.6 67.9 86.5 120.4 158 141.6-24.4-33.8-41.2-84.7-50-141.6h-108z"></path></svg> [https://mhesselbarth.rbind.io](https://mhesselbarth.rbind.io) <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg> [@MHKHesselbarth](https://twitter.com/MHKHesselbarth) ]